stLDA-C: A Topic Model for Short Texts

Author
Affiliation

Zhiqiang Ji

Published

November 26, 2023

Introduction: Topic Modeling and LDA

Topic modeling in Natural Language Processing (NLP) is a technique used to discover hidden themes or topics within a collection of text documents. It’s an unsupervised machine learning technique, meaning it doesn’t require predefined tags or training data that’s been previously classified by humans. The main objective of topic modeling is to discover topics that are clusters of words expressed as a combination of strongly related words.

One popular algorithm for topic modeling is Latent Dirichlet Allocation (LDA). Topic modeling is used in various applications such as chatbots, autocorrection, speech recognition, language translation, social media monitoring, hiring and recruitment, email filtering, and more.

What’s LDA?

Latent Dirichlet Allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words. The topic proportions of a document are assumed to have a Dirichlet prior. The topic-specific word distributions also have a Dirichlet prior.

LDA Diagram, credit:Think Infi

LDA has these key assumptions:

  • Documents are represented as random mixtures over latent topics, where each topic is characterized by a distribution over words.
  • Documents are exchangeable.

Requirements for the data:

  • Each document must represent a mixture of topics.
  • Each word must be generated from a single topic.

What’s LDA’s problem with short texts like tweets?

  • LDA is designed for long documents, but tweets are short.
  • LDA assumes that each document is a mixture of topics, but tweets are often about a single topic.
  • Short texts from social network platform has other characteristics that are not considered by LDA, such as users, hashtags, mentions, etc.

What are the fixes by far? What are their problems?

  • Merge all short documents by the same users into a long document
  • Use a single topic for each short document
    • Per-shot-text LDA (aka. Twitter-LDA)
    • Dirichlet-Multinomial Mixture (DMM)
  • Learn topics from longer documents (e.g., news articles) and apply them to short texts
  • Classify shot texts utilizing neural networks
    • The word mover’s distance (WMD)
    • Word embeddings
  • Clustering techniques

However, while there are methods available for analyzing short-text documents, they do have some limitations. Specifically, these methods do not retain user information and co-occurrence of words within the same short texts. Additionally, they require a corpus of long text that is already compatible with the short-text documents, and relying on pre-trained word embeddings may not accurately reflect the specific vocabulary and semantic usage of words in the short texts.

Introducing the stLDA-C Model

The stLDA-C model was proposed by Tierney et al. in their paper “Author Clustering and Topic Estimation for Short Texts.” This model particularly aims to improve topic estimation in brief documents, such as social media posts, and incorporates the grouping of authors for more effective analysis.

stLDA features:

  • Short text LDA topic model with unsupervised clustering of authors of short documents - Fusing the clustering of both authors(users) and documents
  • Hierarchical model capalbe of sharing information at multiple levels leading to higher quality estimates of per-author topic distributions, per-cluster topic distribution centers, and author cluster assignments.

The stLDA-C model is specifically designed to handle the sparsity of words in short texts by considering the additional structure provided by user clusters and potentially by integrating external information or employing different priors that are more suitable for short texts.

What’s new in the stLDA-C model?

To understand what’s new in the stLDA-C model, let’s first take a closer look at the traditional LDA model.

Traditional LDA

stLDA-C

Quick summary of the traditional LDA notations:

W: Word

Z: Topic

LDA Input:

  1. M number of documents
  2. Each of these documents have N number of words

LDA Output:

  1. K number of topics (cluster of words)
  2. Φ distribution (document to topic distribution)

Compared with the traditional LDA, the stLDA-C model adds a layer of user clustering and a layer of hierarchical topic distributions. From the diagrams, we can see that the stLDA-C model introduced several changes and additions:

  1. The model considers \(G\) clusters of users, where \(G\) is a hyperparameter.
  2. \(G_u\) represents the assignment of each user to a specific cluster, governed by the \(\phi\) parameter.
  3. \(\alpha_g\) is the vector parameter of a Dirichlet distribution over topics choices for users in cluster \(g\).
  4. \(\phi\) represents the distribution over user clusters. It forms a prior for the user cluster assignments. In traditional LDA, there is no concept of user clusters, so this parameter \(\nu\) is specific to stLDA. \(\phi\) encodes the proportion of users in each group and forms a prior distribution for \(Gu\).
  5. \(\theta_u\): Because the model assumes that each document (tweet) is generated by a single topic, the consideration for the document-topic distribution is replaced by user-topic distribution. Each user-specific topic distribution \(\theta_u\) is a draw from \(Dir(\alpha_g)\), where \(g\) is the cluster assignment of user \(u\).
  6. \(Z_{ud}\) is the topic of each tweet \(d\) by user \(u\). \(Z_{ud}\) is a single draw from \(\theta_u\), and all words in tweet \(ud\) are sampled from the topic distribution over words, \(\beta_t\), where \(Z_{ud} = t\).

The generative process of the stLDA-C model is as follows:

Generative process of the stLDA-C model

TL;DR

Very intimidating, right? Let’s break it down:

stLDA-C model workflow

Three key takeaways from the stLDA model:

  1. User Clustering: stLDA clusters users by topic preferences, enhancing the analysis of datasets where authorship is significant.
  2. Hierarchical Topic Distributions: The model employs hierarchical priors for nuanced cluster-level and user-specific topic analysis.
  3. Integrated Topic-User and Word Analysis: stLDA combines topic-user dynamics with word co-occurrence for comprehensive short text analysis.

The stLDA-C Model in Action: Analyze US Congress members’ tweets

We gathered tweets from U.S. Congress members posted from August 16 to September 26, 2020. Our focus was on the 60 most active Twitter users among them, specifically selecting their top 20 tweets based on the highest combined counts of likes and retweets. Among these 60 Congress members, 37 are Democrats and 22 are Republicans.

Visualize the networks of tweets and users

First, we will visualize the networks of tweets and users. The nodes in the network are the tweets and users, and the edges are the co-occurrence of words in the tweets. The size of the nodes is proportional to the betweeness centrality of the nodes (in the following plot, users). Here we presenting a static and an interactive version of the network of users.

Code
# Create the networks
tweets_w <- PrepText(top_tweets, groupvar = "screen_name", textvar = "text_com", node_type = "words", tokenizer = "words", pos = "nouns", remove_stop_words = TRUE, compound_nouns = FALSE)
tweets_g <- PrepText(top_tweets, groupvar = "screen_name", textvar = "text_com", node_type = "groups", tokenizer = "words", pos = "nouns", remove_stop_words = TRUE, compound_nouns = FALSE)

tweets_w_nw <- CreateTextnet(tweets_w)
tweets_g_nw <- CreateTextnet(tweets_g)

# Save the networks to local files
saveRDS(tweets_w_nw, "rcs/data/tweets_w_nw.rds")
saveRDS(tweets_g_nw, "rcs/data/tweets_g_nw.rds")

# # ## Check the distribution of degree of the nodes
degree <- degree(tweets_w_nw)
hist(degree, breaks = 100, main = "Degree Distribution of Words", xlab = "Degree")

Code
# ## Check number of nodes and edges
# vcount(tweets_w_nw)
# ecount(tweets_w_nw)
# vcount(tweets_g_nw)
# ecount(tweets_g_nw)

VisTextNet(tweets_g_nw, alpha = 0.25, label_degree_cut=10, betweenness=TRUE)

Code
VisTextNetD3(tweets_g_nw, alpha = 0.2, charge=-50,zoom = TRUE)

The network visualization reveals that the 40 users are grouped into six clusters. However, these groupings do not necessarily align with the cluster estimates provided by the stLDA-C model.

Code
## Get a subset from "tweets_w" of 10 users with 10 words with highest degree
# select the top 5 lemmas for each user
top_lemmas_per_user <- tweets_w %>%
  group_by(screen_name) %>%
  slice_max(order_by = count, n = 10, with_ties = FALSE)

# This example selects the top 20 users based on the total count of their lemmas
top_users <- top_lemmas_per_user %>%
  group_by(screen_name) %>%
  summarise(total_count = sum(count)) %>%
  arrange(desc(total_count)) %>%
  slice_head(n = 25) %>%
  ungroup()

# Finally, subset the original top lemmas dataset to only include these top users
final_subset <- top_lemmas_per_user %>%
  filter(screen_name %in% top_users$screen_name)



## Copy tweets_w and rename "screen_name" to "groupvar" and "text_com" to "textvar"
tweets_data <- final_subset %>%
  rename(groupvar = screen_name) %>%select(groupvar, lemma)


# Create edges between outlets and lemmatized words
edges_outlet_word <- tweets_data %>%
  select(groupvar, lemma) %>%
  distinct()

# Convert to edge list and create a graph
edge_list <- as.data.frame(edges_outlet_word)
g <- graph_from_data_frame(edge_list, directed = FALSE)

# Set type attribute
V(g)$type <- ifelse(V(g)$name %in% tweets_data$groupvar, TRUE, FALSE)

# Convert to networkD3 format and create interactive plot
network_data <- igraph_to_networkD3(g)
network_data$nodes$group <- ifelse(network_data$nodes$name %in% tweets_data$groupvar, "Outlet", "Word")

# Interactive plot
forceNetwork(Links = network_data$links, Nodes = network_data$nodes,
             Source = "source", Target = "target",
             NodeID = "name", Group = "group", 
             zoom = TRUE, fontSize = 30, charge=-30,
             colourScale = JS("d3.scaleOrdinal().range(['#76b7b2', '#f28e2b'])"))

This interactive plot shows the network of the top 10 words and top 25 users with the highest degree. The nodes in blue represent the top 10 users, and the nodes in orange represent the top 10 words. This plot has nothing to do with the topic modeling, but simply shows how the users and their words conduct a 2-mode or bipartite network. The stLDA-C model is able to capture the distribution of topics in this network.

Use stLDA-C to analyze US Senators’ tweets

Now, we’ll try to use the stLDA-C model to analyze the tweets of the 40 senators. We will use the same code as the authors of the stLDA-C model provided in their demo. We set the number of topics to 6, which is the same as the number of clusters we found in the network visualization. The number of user clusters is set to 2, indicating the party affiliation of the congress members.

Code
#######################
### Visualizations ####
#######################

#print top 15 words from each topic
groundtruth_estimate[["tw"]] %>% 
  top_topic_words(words = words,n=15) %>% 
  t
      [,1]       [,2]       [,3]       [,4]       [,5]       [,6]      
 [1,] "campaign" "campaign" "american" "campaign" "campaign" "american"
 [2,] "country"  "care"     "campaign" "care"     "country"  "care"    
 [3,] "court"    "country"  "care"     "country"  "court"    "country" 
 [4,] "election" "days"     "country"  "court"    "election" "court"   
 [5,] "first"    "election" "court"    "election" "first"    "election"
 [6,] "get"      "first"    "election" "first"    "get"      "first"   
 [7,] "going"    "get"      "first"    "get"      "going"    "get"     
 [8,] "law"      "going"    "get"      "help"     "help"     "ginsburg"
 [9,] "must"     "health"   "help"     "must"     "must"     "going"   
[10,] "police"   "help"     "must"     "police"   "police"   "health"  
[11,] "right"    "right"    "november" "right"    "right"    "help"    
[12,] "today"    "thank"    "right"    "supreme"  "supreme"  "really"  
[13,] "voting"   "today"    "today"    "today"    "today"    "right"   
[14,] "want"     "voting"   "want"     "want"     "want"     "voting"  
[15,] "women"    "want"     "women"    "women"    "women"    "women"   
Code
#print cluster means with user-level topic estimates overlayed
#grey bars are cluster-level expected values, colored lines are each user's topic distribution
#note that clusters with 1 user do not visualize well

# Extract estimated cluster assignments from the model results
ca_est <- groundtruth_estimate[["ca"]]  %>% results_freq_table() %>% apply(1, which.max)

# The following line is commented out because ca_true doesn't exist in your actual data scenario
# table(ca_est, ca_true)

# Function to plot clusters
plot_clusters <- function(ut_mat, cluster_assignment, cluster_alphas, yRange = c(0, .5)) {
  cluster_means <- cluster_alphas %>% {./rowSums(.)}
  ut_mat <- ut_mat %>% {./rowSums(.)}
  
  lapply(unique(cluster_assignment), function(c) {
    ut_mat %>%
    {.[cluster_assignment == c, ]} %>%
      t %>%
      data.frame(Topic = 1:ncol(ut_mat), .) %>%
      reshape2::melt(id.vars = "Topic") %>%
      ggplot(aes(x = Topic, y = value)) +
      geom_line(aes(color = variable)) +
      guides(color = "none") +
      geom_bar(data = data.frame(x = 1:ncol(ut_mat), y = cluster_means[c, ]), aes(x = x, y = y), alpha = .5, stat = "identity") +
      labs(title = str_c("Cluster ", c, " (n=", sum(cluster_assignment == c), ")"), y = "Probability") +
      ylim(yRange)
  })
}

# Generate and arrange cluster plots
clusterPlots <- plot_clusters(ut_mat = groundtruth_estimate[["ut"]] %>% results_array_mean(),
                              cluster_assignment = groundtruth_estimate[["ca"]] %>% results_freq_table() %>% apply(1, which.max),
                              cluster_alphas = groundtruth_estimate[["alphag"]] %>% results_array_mean())

clusterPlots %>% gridExtra::grid.arrange(grobs = .)

Conclusion

Unfortunately, the stLDA-C model did not perform as well as we had hoped. The key words for the topics were not distinguishable enough to be useful, and the cluster assignment did not work on the 60 users. (For a ideal result, please refer back to the workflow diagram above.)

Despite the advanced capabilities of the short text Latent Dirichlet Allocation (stLDA) model, obtaining ideal results from the analysis of 1200 tweets from 60 users proved challenging, primarily due to the inherent sparsity and brevity of tweets. It will take some trial and error to determine the optimal number of topics and clusters to use in the model.

Additionally, the effectiveness of stLDA is highly sensitive to text preprocessing choices and hyperparameter settings, which require meticulous tuning. To improve future analyses, a more extensive dataset could be beneficial, alongside a refined approach to preprocessing and an iterative process of parameter optimization to better capture the nuances of short text data.


Explore more: